feat(native): Create a runtime metric for worker uptime to be used for restart alerts#26979
Conversation
Reviewer's guide (collapsed on small PRs)Reviewer's GuideAdds a new periodic metric in the C++ Presto worker that reports process uptime in seconds, aligning with an existing Java worker metric for restart alerting. Sequence diagram for worker uptime metric recordingsequenceDiagram
participant PrestoServer
participant PeriodicTaskManager
participant UptimeTask
participant MetricRegistry
PrestoServer->>PeriodicTaskManager: addTask(worker_runtime_uptime_secs, interval=2_000_000)
Note over PrestoServer,PeriodicTaskManager: Task captures start_ time from PrestoServer
loop Every 2 seconds
PeriodicTaskManager->>UptimeTask: invoke()
UptimeTask->>UptimeTask: seconds = now() - start_
UptimeTask->>MetricRegistry: RECORD_METRIC_VALUE(kCounterWorkerRuntimeUptimeSecs, seconds)
end
Class diagram for uptime metric and countersclassDiagram
class PrestoServer {
- start_ std::chrono::steady_clock::time_point
- periodicTaskManager_ PeriodicTaskManager*
+ addServerPeriodicTasks() void
}
class PeriodicTaskManager {
+ addTask(taskFunction, intervalMicros int64_t, name std::string) void
}
class Counters {
}
class MetricRegistry {
+ registerPrestoMetrics() void
+ DEFINE_METRIC(counterName, statType) void
+ RECORD_METRIC_VALUE(counterName, value) void
}
class WorkerRuntimeUptimeMetric {
+ kCounterWorkerRuntimeUptimeSecs folly::StringPiece
}
PrestoServer --> PeriodicTaskManager : uses
PrestoServer --> WorkerRuntimeUptimeMetric : records
WorkerRuntimeUptimeMetric --> MetricRegistry : defined_in
MetricRegistry <|-- Counters
%% Constants represented as static members
class PartitionedOutputBufferMetrics {
+ kCounterPartitionedOutputBufferGetDataLatencyMs std::string_view
}
PartitionedOutputBufferMetrics --> MetricRegistry : defined_in
WorkerRuntimeUptimeMetric --> PartitionedOutputBufferMetrics : similar_metric_group
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
997e20d to
4a52a1a
Compare
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- The new uptime metric is registered as StatType::AVG, but since this is a monotonically increasing uptime value it would be more accurate to use a gauge-like type (e.g., LAST/MAX) rather than an average.
- In Counters.h the new kCounterWorkerRuntimeUptimeSecs constant uses folly::StringPiece while surrounding counters use std::string_view; consider using the same type for consistency unless there is a specific reason to differ.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new uptime metric is registered as StatType::AVG, but since this is a monotonically increasing uptime value it would be more accurate to use a gauge-like type (e.g., LAST/MAX) rather than an average.
- In Counters.h the new kCounterWorkerRuntimeUptimeSecs constant uses folly::StringPiece while surrounding counters use std::string_view; consider using the same type for consistency unless there is a specific reason to differ.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
4a52a1a to
0945696
Compare
Summary:
Metrics change only.
Added a runtime metric for worker uptime called: presto_cpp_worker_runtime_uptime_secs
The metrics is present in java worker but is missing in c++ worker, causing the missing alert for c++ workers.
Test Plan:
Deployed to staging cluster and checked the metrics status.{F1247554645}
Impact -> no impact since this runs in a background thread
Reviewers: #ldap_velox-core, jay.narale
Reviewed By: #ldap_velox-core, jay.narale
JIRA Issues: PRESTO-9381
Differential Revision: https://code.uberinternal.com/D20790631
0945696 to
72dcc86
Compare
|
LGTM, cc: @aditi-pandit can you please help review |
8bd1b03 to
f542457
Compare
|
Hi @jja725 : I'm fine with your code change. Though curious as you mention this broke existing Java code. Can you point that code to me ? Or was it Uber internal ? |
It's just uber internal |
|
Are these runtime metrics documented anywhere? I didn't notice when I looked. If they are not, should they be? |
I don't think there's doc regarding these metrics. Probably we need another PR for the document |
Yes, I agree, given this PR is approved and ready to merge there's no need to hold it up now. I will open an issue to track the doc improvement needed. |
Description
Added a runtime metric
presto_cpp.worker_runtime_uptime_secsto track how long a C++ worker has been running since startup. This metric is recorded every 2 seconds via a periodic task.Motivation and Context
The worker uptime metric exists in the Java worker but was missing in the C++ (Presto native) worker.
Impact
presto_cpp.worker_runtime_uptime_secs- reports the number of seconds since the worker process startedTest Plan
Contributor checklist
Release Notes